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In this paper, we present two symbiotic optimizations to optimize recursive task parallel (RTF) programs by reducing the 
task creation and termination overheads. Our first optimization Aggressive Finish-Elimination (AFE) helps reduce the re¬ 
dundant join operations to a large extent. The second optimization Dynamic Load-Balanced loop Chunking (DLBC) extends 
the prior work on loop chunking to decide on the number of parallel tasks based on the number of available worker threads, 
at runtime. Further, we discuss the impact of exceptions on our optimizations and extend them to handle RTF programs that 
may throw exceptions. We implemented DCAFE (= DLBC+AFE) in the X10v2.3 compiler and tested it over a set of bench¬ 
mark kernels on two different hai'dwares (a 16-core Intel system and a 64-core AMD system). With respect to the base XIO 
compiler extended with loop-chunking of Nandivada et al [Nandivada et al.(2013)Nandivada, Shirako, Zhao, and Sarkar| 
(EC), DCAFE achieved a geometric mean speed up of 5.75x and 4.16x on the Intel and AMD system, respectively. We 
also present an evaluation with respect to the energy consumption on the Intel system and show that on average, compared 
to the EC versions, the DCAFE versions consume 71.2% less energy. 
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1. INTRODUCTION 

The onset of multi-core architectures has brought forth a shift in programming paradigm from 
sequential programs to parallel programs. This shift has led to an increased interest in task par¬ 
allel lan guages, such as XIO [ Saraswat et al.(2012)Saraswat, Bard, Igor, Tardieu, and Grove 
Ch apel [ Chamberlain et al.(2007)Chamberlain, Calla han, and Zimaj , OpenMP |OpenMP(2008) 


HJ I Cave et al.(201 l)Cave, Zhao, Shirako, and Sarkar) , and so on. These languages allow the pro- 
grammer to express the desired amount of parallelism (a.k.a ideal parallelism), while delegating 
the task of extracting the useful parallelism to the compiler (or runtime). In this paper, we present 
two novel compiler optimizations (targeting recursive task parallel programs) to extract useful par¬ 
allelism from the ideal. 
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1 def find_queens () { 

2 

3 nqueens (n, 0, ...); } 

4 def nqueens (val n:Int, val j:Int, ...) { 

5 finish { 

6 for(var i:lnt=0; i<n; i++) { 

7 async { 

8 ... I * Checking if none of the queens conflict */ 

9 nqueens (n, j+1, ...); } } } } 

(a) 

4 def nqueens (val n:Int, val j:Int, ...) { 

5 var nChunks : Int=Runtime . retNthreads (); 

6 var chunkSize :Int=(n+nChunks~l)/nChunks; 

7 finish { 

8 for(var ii:lnt=0; ii<n; ii+=chunkSize) { 

9 val ni = ii; 

10 async { 

11 var kxrint = ni+chunkSize; 

12 if(kx>n) kx=n; 

13 for(var i:Int=ni; i<kx; i++) { 

14 ... I * Checking if none of the queens conflict */ 

15 nqueens (n, j+1, ...); } } } } } 

(b) 


Fig. 1; BOTS Nqueens kernel in XIO; (a) Unoptimized version (b) Loop Chunked version of the 
nqueens function. 


Recursive Task Parallel (RTP) programs constitute an important subset of task parallel programs. 
In RTP programs the parent task spawns a set of child tasks, which in turn can recursively spawn 
further tasks. This renders the problem of extracting useful parallelism from the ideal quite chal¬ 
lenging in case of RTP programs (compared to non-RTP programs). We will use an example to 
illustrate the same. 

Figure [T|^a) shows the snippet of the BOTS | Duran et al.(2009)Duran, Teruel, Ferrer, Martorell, 


[and Ayguade| Nqueens kernel, in XIO. The async construct spawns a new child task to execute 
the statement within its body, in parallel with the parent task. The finish construct acts as a 
join point for all the tasks spawned in its body. The code in Figure a) shows that the presence of 
recursive task parallelism may lead to the execution of a large number of finish operations at run¬ 


time (for example, when n=14, it executes 27 million finish operations). Prior work |Nandivada 
et al.(2013)Nandivada, Shirako, Zhao, and Sarkar) shows that eliminating unnecessary finish 
operations can lead to significant performance improvements. However, their proposed technique 
does not lead to any reduction in the number of finish operations, in this example. Interestingly, 
we observe that each task spawns new child tasks, and waits at the join point for the spawned 
tasks to terminate. After that the task simply returns from the procedure. Hence, this finish con¬ 
struct can be pulled out of the nqueens method and placed around its non-recursive call site (in 
find_queens). Or in other words, the finish construct, in the method nqueens can be de¬ 
clared redundant (and hence removed), if we surround the non-recursive call to nqueens with a 
finish construct. Such an optimization helps us in reducing the number of finish operations 
to just one (for this code), which can lead to significant performance gains. 

In general, it is not trivial to pull out finish constructs, as they may be nested deep inside some 
if/while constructs. The problem becomes further challenging, if the input code may throw 
exceptions. We address these challenges in the first optimization we propose in this paper (called 
Aggressive Finish-Elimination, or AFE in short). AFE helps to eliminate redundant finish oper¬ 
ations in RTP programs, in a semantics preserving way. 
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Further analysis of Figure [^a) shows that at the level of recursion, the nqueens function 
creates number of asyncs (tasks) leading to an explosion of tasks (for example, when n=14, 
it creates a total of 377 million tasks), which in turn results in large performance overheads. The 
powerful scheme of Loop Chunking |Nandivada et al.(2013)Nandivada, Shirako, Zhao, and Sarkar) 
(henceforth referred to as LC) helps to extract useful parallelism from the ideal. LC splits the it¬ 
erations of a large parallel loop into a set of chunks, where each chunk (containing a set of serial 
iterations) executes in parallel. 

Figure presents the LC version of the nqueens function. Here, the call to the 

Runtime. retNthreads function returns the initial count of the worker threads. Hence, the 
useful parallelism is bound by nChunks. Considering this, LC ensures that at most nChunks 
number of tasks are created in any invocation of this function. Thus, at level k of recursion, it cre¬ 
ates nChunks^ number of tasks (for example, when n=14 and nChunks=8, it creates 189 million 
tasks). This chunked program runs faster than the unoptimized version, but still incurs a large task 
creation and termination overhead. This is because the chunking algorithm is oblivious to the re¬ 
cursive call inside the loop, and hence, permits the spawning of a large number of tasks. We have 
observed a similar trend in a number of RTF kernels present in two open-sourced benchmark suites: 
IMSuite |Gupta and Nandivada(2015) | and BOTS. 

The main reason for theT.C version to incur the large overheads is that it does not exploit the 
underlying recursive nature of the task parallel program and misses significant opportunities to op¬ 
timize such programs. To address this challenge, we propose our second optimization “Dynamic 
Load-Balanced loop Chunking” (DLBC), as an extension to LC. DLBC generates code that spawns 
new tasks (to execute some iterations of a loop) only if “idle” workers are available, at runtime. Oth¬ 
erwise, the current worker executes the loop serially. During the serial execution, if some workers 
get freed up, the remaining loop iterations may be executed in parallel (by the available workers). 
Our transformation leads to significant reduction in the number of tasks created: for example, for 
Figure[TJa), when n=14, our transformed code creates 6 million tasks (« 30x less, compared to the 
LC version). 

Realising the above mentioned extensions requires multiple design choices (e.g., how to identify 
the number of available workers, how to divide work among the current and available workers, when 
to execute the code in the serial mode, when to switch back to parallel execution, and so on), that 
are non-trivial in nature. We studied many different design alternatives and designed DLBC using 
the best available choices. 

Our Contributions 

• We propose two symbiotic optimizations AFE and DLBC, for improving the performance of RTF 
programs that reduce the redundant join and task creation operations. DCAFE (= DLBC + AEE) 
can be easily extended to other task parallel languages (such as HJ, Chapel and OpenMF) that have 
similar constructs for task creation and task termination operations. 

• We present an extension to the X10v2.3 compiler that implements DCAEE. 

• We extend DCAEE to perform semantics preserving code transformation even in the presence of 
exceptions. 

• We evaluated DCAEE over 8 benchmarks (drawn from two benchmark suites: IMSuite and BOTS) 
on two different hardware systems (a 16-core Intel system and a 64-core AMD system). We show 
that DCAEE leads to improved execution times (geometric mean of 5.75x on the Intel and 4.16x 
on the AMD system, with respect to the LC version; and geometric mean of 12.64x on the Intel 
and 5.25 X on the AMD system, with respect to the unoptimized version). 

• We also show that the use of DCAEE leads to significantly lower energy consumption, on the 
Intel system. On average, DCAEE optimized codes consume 0.288 x the energy consumed by LC 
optimized codes, and 0.19x the energy consumed by the unoptimized codes. 

Organization: Section|^presents a brief background of some of the topics pertinent to this paper. 
Section|^discusses the details of AEE and DLBC. In Section]^ we present relevant changes to these 
transformations in the presence of exceptions. In Section]^ we present an evaluation of DCAEE. 
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1. Loop-Finish Interchange 

for (SI;c;s2) { finish S3 } SI; finish { for(;c;S2) {S3}} 

// Say Ea = set of e-asyncs in S3 
// “iBe G i?s-' c has dependence on e. 

H -iBe £ Eg: e has loop carried dependence on S2, c or S3 

2. Finish Fusion 

finish{Sl} finish{S2} finishjSl; S2} 

// S2 has no dependence on any e-async of SI. 

3. Tail Finish Elimination (Simplified) 

finish finish SI finish SI 

Fig. 2: Existing mini-transformations. 

We present a discussion about some of the salient aspects of our work in Section and present a 
discussion on the related work in Section]^ Finally, we conclude in Section]^ 


2. BACKGROUND 
2.1. X10 

We briefly describe the XIO constructs relevant to this manuscript (see the language man¬ 
ual [Saraswat et al.(2012)Saraswat, Bard, Igor, Tardieu, and Grove| for details), “async SI” 
spawns a new asynchronous task to execute SI. A task can he registered on one or more clocks, 
“async clocked (cl, c2) S” registers the new spawned task on the clocks cl and c2. Such 
a task executing Clock . advanceAll (), waits for all the tasks registered on cl and/or c2 to 
execute the barrier Clock . advanceAll (). “finish SI” waits for all the tasks spawned in 
S1 to terminate. In XIO, each async has a unique Immediately Enclosing Finish (lEF), at runtime. 
Note: statically an async may have multiple lEFs. 

During execution, when an exception is thrown in an async, it is caught by its lEE The enclosing 
finish waits for termination of the remaining tasks, and then packages all the thrown exceptions 
as a MultipleExceptions object, and throws it again. Note: an exception that occurs in one 
task (async) does not terminate the sibling tasks. 

XIO runtime is built around the notion of workers. Each worker is assigned a task to execute and 
can be seen as a software thread. The initial count for workers can be set (typically to to the number 
of available cores) at runtime, using the environment variable X10_NTHREADS. During execution, 
XIO runtime also tracks the number of idle-workers - workers which are assigned no task. 


2.2. Finish Elimination 

Finish Elimination [Nandivada et al.(2013)Nandivada, Shirako, Zhao, and Sarkar) helps to remove 
redundant finish constructs - finish constructs that do not contain e-asyncs. For a statement 
S, all the async statements (within S) whose lEF is not enclosed within S are called escaping 
asyncs or e-asyncs |Guo et al.(2009)Guo, Barik, Raman, and Sarkar | of S. The ‘Finish Elimination’ 
optimization repeatedly applies a series of transformations to eliminate the redundant finish 
constructs. Three of their proposed set of mini-transformations (Loop-Finish Interchange, Finish 
Fusion and a simplified version of Tail Finish Elimination) are relevant to this work. For the sake of 
completeness, we reproduce these rules in Figure]^ Each transformation may include a set of pre¬ 
conditions (shown as comments) necessary to ensure semantics preserving transformation. Loop- 
Finish Interchange is feasible when, neither there is a loop carried dependence between the itera¬ 
tions of the loop, nor the loop condition depends on the e-asyncs of S3. This rule can be trivially ex¬ 
tended for other looping constructs such as, while and do-while. Finish Fusion merges two finish 
statements, if S2 has no dependence on the e-asyncs of SI. Tail Finish Elimination eliminates the 
trivially redundant finish constructs. 
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Fig. 3: Block diagram of DCAFE 


2.3. Energy Measurements (Intel specific) 

Running Average Power Limit (RAPE) | Intel(2014)) is an interface that exposes the Machine Spe- 
cihc Registers (MSRs) to the user application. MSRs facilitate the measurement of the energy con¬ 
sumed by different components of the CPU. The MPES register (MSR_PP0-ENERGY_STATUS) in 
MSRs stores the total energy consumed by all the cores in a node. We have implemented a function 
readjnsr () to read this register, in our generated code. We couldn’t hnd a similar interface for 
our AMD system. 


3. TRANSFORMATION SCHEME 

In this section, we discuss two novel optimizations: Aggressive Finish-Elimination (AFE) and Dy¬ 
namic Load-Balanced loop Chunking (DLBC). We propose a new compiler optimization phase 
called DCAFE (= DLBC + AFE) that combines these two optimizations. DCAFE (overall block di¬ 
agram shown in Figure]^ starts by performing a simple may-happen-in-parallel (MHP) dependence 
analysis. For this work, we perform an inter-procedural MHP analysis, as an extension to that of 
Agrawal et al. [Agarwal et al.(2007)Agarwal, Barik, Sarkar, and Shyamasundar j. The MHP analysis 
is used to compute the may-happen-before dependence (MHBD) [Nandivada et al.(2013)Nandivada, 
IShirako, Zhao, and Sarkar) information. After the MHP analysis, DCAFE invokes AFE and DLBC 
optimizations, before doing the code generation. For the sake of simplicity, in this section, we as¬ 
sume that the programs do not throw exceptions. In Section|2 we extend our proposed optimizations 
to do semantics preserving transformations of XIO programs that may throw exceptions. 


3.1. Aggressive Finish-Elimination (AFE) 

AFE aims at elimination of redundant finish constructs, and expanding the scope of finish op¬ 
erations, if possible. AFE consists of eight mini-transformations that aim to pull out finish con¬ 
structs from different methods to their respective call-sites. Three of these mini-transformations have 
been proposed by Nandivada et al |Nandivada et al.(2013)Nandivada, Shirako, Zhao, and Sarkar| 
(Figure]^. The rest hve transformations (Async-Finish Interchange, Finish-If Interchange, Finish 
Expansion Upper, Finish Expansion Lower, and Finish-Method Pull), shown in Figure are new. 
The necessary pre-conditions for any mini-transformation are specihed as comments. 
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1. Finish-If Interchange 
if(e) {finish SI} 

v=e; finish 

{if(v) SI} 

2. Finish Expansion Upper 

SI; finish {S2} 

finish {SI; 

S2} 

////’SI has no e-asyncs registered on clocks. 


3. Finish Expansion Lower 
finish {SI}; S2 

finish {SI; 

S2} 

// Say Es = set of e-asyncs in S1 
// -’3e G Eat S2 has dependence 

on e. 


// S 2 is not a barrier; S 2 has no e-asyncs registered on clocks. 

4. Async-Einish Interchange 



async finish SI 

finish { async SI} 

5. Finish-Method Pull 



def f2 0 {foo 0;} 

def f2 0 { 


def foo () { =1* 

finish { 

foo (); } } 

finish SI; } 

def foo() { 

SI } 

// If finish-method has not been already applied on too (). 


Fig. 4; Mini Transformations to facilitate AFE 


Finish-If Interchange pulls out a finish construct from the surrounding if construct. A special 
case handling the if-then-else statement is shown below: 

if(cond) V = cond 

finish SI finish { 

else if(v) SI 

finish S2 else S2 } 

The switch-case statement is also handled similarly. Finish Expansion Upper expands the finish scope 
by pulling a preceding statement SI in its scope. It requires that SI does not have any e-asyncs registered 
on clocks. Finish Expansion Lower expands the scope of the finish construct by pulling in a succeeding 
statement SI. It requires that (i) there is no dependence between S2 and the e-asyncs of SI, (ii) S2 should 
not be a barrier, and (iii) S2 does not have any e-asyncs registered on clocks. The Async-Finish Interchange 
interchanges the surrounding async and the inner finish. In conjunction with other transformation rules, 
this rules helps to increase the scope of finish. Finish-Method Pull lifts a finish construct from a method 
to all its possible callers (obtained by a conservative flow analysis). 

The mini-transformations presented in Figure]^ andcan be categorized under two heads (a) main rules: 
transformations to eliminate redundant finish constructs, and (b) helper rales: transformations to expose 
opportunities for applying main rules. For example, Rule #2, and #3 (in Figure]^ reduce the static finish 
operations; and Rule #1 (in Figure]^ and Rule #5 (in Figure]^ can reduce the dynamic finish operations; 
these rales fall in the category of mam rules. The Rules #1 —#5 in Figure|^are examples of helper rules. Note: 
1) Rule #5 is both ‘main’ rule and a ‘helper’ rale, ii) The listed transformations can be applied in any order. 

We start applying AFE on the leaf nodes of the call graph and continue applying AFE on their parent nodes 
(by avoiding the already visited nodes in the call graph, to take care of cycles due to recursion). This process 
continues till one of the following scenarios is reached: (a) finish construct has been pushed to the main 
method - and no further processing of code is required, or (b) finish construct cannot be pulled out of the 
method, due to dependences, and thus partial rollback takes place: We follow a simple all or nothing strategy 
for expanding the scope of a finish. If the finish construct cannot be pulled out of a method then the 
method is reverted to its original state (the state just before the AFE is applied on this method). 

AEE is guaranteed to halt as - (a) the number of call sites are finite, (b) every method is processed only 
once, and (c) no finish constructs are added to the recursive call sites. 

Sample Transformation: 

We now present the working of AFE on the input code shown in Eigure|^a). Assume that SI, S2, S3, and S4 
have no e-asyncs. Eigures|^b-h) show the effect of applying AFE on the input code. AEE starts by applying 
Finish Fusion (Eigure [^b)), followed by Finish-If Interchange (Figure |^c)). Then, it applies Async-Finish 
Interchange (Figure |^d)), Loop-Finish Interchange (Figure [^e)) followed by Tail Finish Elimination (Fig- 
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// Example Code 

SI; 

finish { 
for (i in 0. .n){ 
async { 
if(cond) { 
finish S2 = 
finish S3 
} } } } 

S4; 

(a) 

// After Loop-Finish 
// Interchange 

SI; 

finish { 
finish { 

for (i in 0..n){ 
async { 
if (cond) { = 

S2; S3 

} } } } } 


S4 


// After Finish Fusion 

SI; 

finish { 
for(i in 0..n){ 
async { 
if(cond) { 

> finish { 

S2; 

S3 

} } } } } 


S4 


(b) 


S4 


// After Finish-If 
// Interchange 

SI; 

finish { 
for(i in 0 . . i 
async { 

finish { = 

if (cond){ 
S2; S3 
} } } } } 


// After Async-Finish 
// Interchange 

SI; 

finish { 

){ for(i in 0.. 

finish { 
async { 

if (cond) 

S2; S3 
} } } } } 


S4 


(c) 


// After Tail Finish 
// Elimination 

SI; 

finish { 

for(i in 0..n){ 
async { 
if(cond) { 

S2; 

S3 

} } } } 


0 { 


S4 


(d) 


// After Finish 
// Expansion Upper 
finish { 

SI; 

for (i in 0 . . n){ 
async { 

if(cond) { 

> S2; ^ 

S3; 

} } } } 


S4 


(e) 


(f) 


(g) 


// After Finish 
// Expansion Lower 
finish { 

SI; 

for(i in 0..n){ 
async { 

if(cond) { 
S2; 

S3 

} } } 

S4 } 

(h) 


Fig. 5: Applying AFE on a running example 

ure|^f)). Next, it applies Finish Expansion Upper (Figurej^g)) followed by Finish Expansion Lower to obtain 
the code in Figurej^h). 

3.2. Dynamic Load-Balanced loop Chunking 

The existing loop-chunking (LC) optimization [Nandivada et al.(2013)Nandivada, Shirako, Zhao, and Sarkar) 
suffers from a drawback that it may create tasks even when there are no idle workers at runtime. This may 
lead to significant overheads (especially in case of RTF programs, where it is common to have many tasks 
created at each level of recursion). Our proposed Dynamic Load-Balanced loop Chunking (DLBC) addresses 
this drawback through two simple, yet effective strategies: (i) dynamic task creation based on the number of 
idle workers and load balancing among the workers, and (iii) serial execution if no idle workers are available. 

3.2.1. Dynamic task creation and load balancing. One main drawback of LC is that it doesn’t distribute 
the work equally among the available workers. Our chunking policy aims at balancing the load through two 
simple techniques: (a) dividing the work equally among all the idle workers, and (b) sparing some work for the 
current worker (worker executing the current task). 

To highlight the unbalanced load distribution inherent in LC consider the code shown in Figure[^ Figure[TJb) 
shows the code after invoking LC on Figure[2a). In Figure[TJb), consider n=12 and number of total workers 
= nChunks = 4. Thus, chunkSize is equal to 3, and we create four tasks (to execute three iterations each). 
Say, excluding the current worker, the other three workers are currently idle. In such a scenario, two of the 
idle workers execute one task each, and the other idle worker will execute two tasks (six iterations), while the 
current worker waits at the join point for the spawned tasks to terminate. 

In contrast, our chunking policy distributes the iterations equally among all the four workers (including the 
current worker) - better load balancing. Further, the current worker gets some useful work to perform, before 
waiting at the join point. Importantly, if n = 10, our scheme provides two iterations each to the current worker 
and an idle worker, and three iterations each to the remaining two workers. Thus, our policy ensures that the 
current worker not only does some useful work (before waiting at the join point), but also gets the smallest 
chunk of iterations to execute. 

We extend the XIO Runtime (XRX) with a function Runtime . ret IdleWorkersO that returns the count 
of idle workers at that instant, at runtime. Our implementation of ret IdleWorkers () does not use any 
atomics. So, in a RTF program, it is possible that two tasks may fetch the same value of idle workers, at the 
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def nqueens (val n:Int,val j:Int, ...) { 
var ii:lnt=0; 

var workers : Int = Runtime .retIdleWorkersO; 


outer: while(true) { 
if (workers>0) { 

val totWorlcers : Int = workers + 1; 

val actualn:Int=n-ii; 

val eqChunk:Int=actualn/totWorkers; 

val newN:Int=actualn-eqChunk; 

var rem:Int=actualn%totWorkers+workers; 

finish { 

for( ; ii<newN; ) { 

val kx = ii+eqChunk+rem/totWorkers; 
val ni=ii; rem—; 11 = kx; 

async { 

for(var i:int=ni ; i<kx; i++) { 

... /* Checking if none of the 

nqueens (n, j, ...); 

} }/* async */}/* outer-for */ 

{ 

for(var 1:int=newN;i<size;i++){ 

... /* Checking if none of the 

nqueens (n, j, ...) ; 

} } } /* finish */ } /* if */ 
else for(1=0; i<n; i++) { 

... /* Checking if none of the 


/ / “chunked block 


queens conflict 

11 “parent block” 

queens conflict 

// “serial block” 

queens conflict 


nqueens (n, j, . . . ); 

workers = Runtime.retIdleWorkers(); 

if(workers>0 && i<n-2) { 

11=1+1; continue outer; 

} } break; } /* while */ } /* nqueens */ 


99 


* / 


* / 


* / 


Fig. 6; DLBC applied on BOTS Nqueens kernel 

same instant. Thus, in practice, the number of tasks created by DLBC may be more than the number of idle 
workers. But we show that the reduction in task creation is significant enough. Although, the use of atomics 
looks lucrative, but it leads to substantial overheads. 

Overall DLBC consists of five substeps (see Figure]^. It starts by invoking LC. The next step is to introduce 
some template code that computes the current count of the idle workers and a set of five helper variables: 
1) totWorkers: # idle workers+1, ii) eqChunk: minimum number of iterations executed by any worker, 
iii) actualn: number of iterations of the parallel loop to be executed, iv) newN: total number of iterations 
to be executed by the idle workers, and v) rem: a temporary variable. This substep also introduces an outer 
while loop, which is used to avoid unstructured control flow. The third substep of DLBC (Chunked-Block- 
Modification) modifies the chunked code to enforce the load balancing scheme discussed above. Similarly, 
the Parent-Block-Generation step introduces code to be executed by the parent thread. For the input code of 
Figure [TJa), Figure shows the code generated by DLBC. The code computes the number of idle workers 
and if workers>0, the execution continues at line 7. The finish body includes a chunked parallel loop 
(chunked-block: executed by the idle workers), and a serial for-loop (parent-block: executed by the current 
worker). 


3.2.2. Serial Execution. DLBC aims to create tasks only if there are idle workers. Ideally, if there are no 
idle workers then a new task should not be spawned. In such cases the current task can be asked to complete 
the remaining job serially. DLBC handles this scenario, by using a simple heuristic: If at the time of task 
creation the number of idle workers are zero, then the loop under consideration should be executed serially. 
This heuristic is enforced by invoking the Serial-Block-Generation substep. This substep emits sequential code 
to be executed when no idle workers are found. Considering the possibility that some workers may get freed up 
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during the life-time of this serial loop, the generated code checks for available idle workers, after each iteration. 
And if idle workers are available, the rest of the iterations are divided into totalWorkers (= number of idle 
workers -l- 1) number of chunks to be executed in parallel. 

The “serial block” in Figure]^ depicts the code generated by the ’Serial-Block-Generation’ substep. An 
interesting point to note is that At the end of each serial iteration, we check the count of the idle workers. If 
that count is greater than zero (and at least two iterations are left to execute, to account for the work available 
for the current worker and at least one of the idle workers), we execute the remaining iterations in parallel. To 
do so, i i is set to the number of iterations that have already been executed, and the control is transferred to 
line 5; at line 8, ii is used to compute the value of actualn. 

3.2.3. Synchronization Operations and DLBC . Our transformation scheme undergoes a small tweak 
to handle synchronization operations, in the input code. Consider the input code shown in Figure|^a) and the 
code generated by LC in Figure|^b). The code generated by DLBC is shown in Figure|^c). Note that the code 
generated by LC substep will always be of the form shown in Figure|^b), where the async body consists of a 
series of serial-for-loop separated by Clock . advanceAll statement; the serial-for-loop bounds are guarded 
by a condition. 

Similar to the code shown in Figure]^ the code in Figure|^c) also contains three distinct blocks “chunked”, 
“parent”, and “serial”. Further, the “chunked block” and the “parent block” have an additional switch state¬ 
ment each. Consider the scenario, when there are no idle workers, and the “serial block” is in execution. After 
executing all the iterations of SI, we check for the availability of idle workers, and if available we go back to 
the “chunked block”, to execute the instances of S2. The switch statement in the “chunked block” helps skip 
the code that is already executed in the “serial block”. This selection happens using the variable phase whose 
value matches the number of Ciock . advanceAii statements executed in the “serial block”. We follow a 
similar strategy for generating code for the “parent block”. 

Note that in the “serial block” we do not check for the availability of idle workers after the execution of each 
instance of Si. This is mainly done to keep a tab on the complexity of the generated code and the overheads. 

3.3. Possible Overheads 

Overheads due to AFE: The code generated by AFE may incur overheads on two accounts (i) 
reduction in parallelism: Consider the code transformation shown below: 

def f2 0 { def f2() { 

foo(); bar(); } finish { foo(); } bar(); } 

def foo() { => def foo() { 

async finish Si } async Si } 

It can be seen that the shift of finish construct from the method foo () to its call site, inhibits 
the parallel execution of Si and the call to the function bar (unless, the scope of the finish can 
be further expanded later to include the call to bar), (ii) management of large number of (clocked) 
activities by a single finish: The task executing the join operation (finish), performs some 
book keeping such as, collecting all the exceptions, deallocating resources, de-registering the tasks 
from the registered clocks (in case of clocked asyncs) and so on. Due to its aggressive nature, AFE 
entrusts all these bookkeeping works of many finish operations (that otherwise may have run in 
parallel) to one finish operation present in a parent task. This may lead to reduction in parallelism 
and performance degradation. 

Overheads due to DLBC: DLBC inserts a number of instructions to do load balancing, and to 
check for the available idle workers. The resulting overheads can offset the gains, especially if these 
computations dominate the actual work done by the tasks. In Section we show that all these 
overheads are compensated by the gains resulting from DCAFE. 

4. EXTENSIONS FOR EXCEPTIONS 

In this section we extend our proposed techniques to generate semantics preserving code in the pres¬ 
ence of XIO exceptions (see Sectionj^for a brief introduction). To motivate the impact of exceptions 
on the presented mini-transformations, consider Rule 2 of Figure]^ being applied on the following 
example, where SI can throw an exception (of type Ex). 
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finish { for(var i:lnt=0; i<n; i++) { 

async clocked(c) { SI; Clock.advanceAll(); S2; } } } 

(a) 

var workers:Int = Runtime.retNthreads(); 
var chunkSize:Int=(n+workers-l)/workers; 

finish { 

for(var ii:lnt=0;ii<n;ii+=chunkSize) { 
val ni = ii; 

async clocked(c) { 

var kx:Int=ni+chunkSize; if {kx>n)kx=n; 
for (var i:Int=ni; i<kx; i++) SI; 

Clock.advanceAll(); 

for(var i:Int=ni; i<kx; i++) S2; } } } 

(b) 

var ii:lnt=0, phase:lnt=0; 

var workers:Int = Runtime.retIdleWorkers(); 
outer: while(true) { 
if(workers>0) { 

val totWorkers:Int = workers+1; 
val actualn:Int = n-ii; 
val eqChunk:Int = actualn/totWorkers; 
val newNrInt = actualn-eqChunk; 
var rem:Int=actualn%totWorkers+workers; 
finish { 

for ( ; ii<newN; ) { //“chunked block” 

val kx:Int=ii+eqChunk+rem/totWorkers; 
val ni=ii; rem—; ii = kx; 

async clocked(c) { 
switch(phase) { 

case 0:for(var i : int=ni;i<kx;i++) SI; 

Clock.advanceAll(); 
case 1:for(var i:int=ni;i<kx;i++) S2; 

} } /* async */ } /* outer-for */ 

switch(phase) { 

case 0:for (var i:Int=newN;i<n;i + + ) 

Clock.advanceAll(); 
case 1:for(var i:Int=newN;i<n;i++) 

} /*parent*/ } /* finish*/ } /*if*l 
else /*workers <= 0*/ { 
for(i=0 ; i<n; i++) SI; 

Clock.advanceAll(); 
workers = Runtime.retIdleWorkers(); 
if (workers>0) { phase++; continue outer; } 
for(i=0;i<n;i++) S2; 

} /* else */ 
break; } /* while */ 

(c) 

Fig. 7: Synchronization operations and chunking, (a) Unoptimized version, (b) LC version, and (c) 
DLBC version. 

try{ SI; finish S2 try{ finish { SI; S2; 

} catch(e:Ex) { ... } } } catch(e:Ex) {...} 

In the LHS, the exception thrown by SI is caught by the catch block. However, in 
the RHS, the finish block catches this exception and in turn throws an object of type 


//“parent block” 

SI; 

S2; 

//“serial block” 
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1. Finish-If interchange 

if (cond) { 

finish { => 


Si: 


<exlist >‘ 


V = cond; 

finish { 

if (cond) Sl}<g5,iigt> 


2. Finish Expansion Upper 

SI; 

finish { => 

^<exlist> 

// + + 

11 e-asyncs in SI do not 
// throw exceptions. 


var e:Exception=null; 
finish { try { SI } 
catch(el:Exception) 
{e = el; } 
if(e == null) S2 
}<if (e!=null)throw e; 
exlist> 


3. Finish Expansion Lower 

finish { ==t 

^<exlist> 

S2 

// -t- 4- 

// e-asyncs in S1 and S 2 
// do not throw exceptions. 

4. Async-Finish Interchange 

async { 

finish {Sl}<> 

//SI throws no exceptions. 

5. Try-Finish Exchange 

try { 

finish { 

^<exlist> 

} catch(e:Ex) 

{ S2 } 

// e-async in SI do not 
// throw exceptions. 


var e:Exception=null; 
finish { SI; 

try { exlist } 
catch (el:Exception) 

{ e = el; } 
if (e==null) { try {S2} 
catch(ex:Exception) 

{ e = ex; } } 

}<if(e!=null)throw e;> 

finish { 
async { SI } 

}<> 

var e:Ex=null; 
finish {try (try {SI} 
catch(el:Exception) 
{throw new ME(el);} 
exlist 

}catch(el:Ex){e=el;} 

} if (e!=null){S2} 


Fig. 8; Rules of Figure|^ in the presence of exceptions. 


MultipleExceptions. Thus, Rule 2 is not semantics preserving, in the presence of exceptions. 
We now extend our transformation rules, to address such challenges. 

To aid the translation process, we use a temporary finish construct of the 
form “finish where exlist represents a sequence of condi¬ 

tional throw statements. Each entry in exlist is of the form “if (ex != null) 
throw ex;”. We call exlist the list of pending exceptions. This temporary con¬ 
struct is translated away, at the end of the translation process, using the following rule: 


finish} SI} <g^j_ist> finish{Sl}; exlist; 


4.1. AFE in the presence of exceptions 

Figures]^ and [^present the rules for doing AFE in the presence of exceptions. Here, we use ME 
to refer to the XIO MultipleExceptions class. Eor brevity, we avoid re-stating the old rules 
specified in Eigures[^and[^and use “// -t-i-” to refer to the same. 

Figure presents the modihcations to our proposed mini-transformations in the presence 
of exceptions. The Finish-If Interchange rule is similar to the one shown in Figure Finish 
Expansion Upper requires that no exceptions are thrown by the e-asyncs in SI. The trans¬ 
formed code catches the exception (if any) thrown in SI and throws the exception outside 
the finish. The execution of S2 occurs only if SI throws no exceptions. Similarly, Finish 
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1. Loop-Finish Interchange 

for(SI;cond;S2) { 

finish { S3 

^ <exlist> 

} 

// -H-H 

// e-asyncs in cond, S2 
// and S3 do not throw 
H exceptions. 


2. Finish Fusion 


SI; var e:Exception=null; 
var me:ME=null,V:Boolean; 

finish { 

for(; ;){ try {v=cond;} 
catch (ex:Exception) 

(e = ex; break; } 
if (e==null && v) { 
try{S3} 

catch (ex:Exception) { 
me=new ME(ex);break;} 
if(me==null) { 
try { exlist } 
catch(ex:Exception) 

{ e = ex; break; } 
if(e==null){ 
try{S2} 

catch (ex:Exception) 
{e=ex; break;}}}}}} 
<if(e!=null) throw e; 
if(me!=null) throw me;> 


finish {Sl}<exlisti> 

finish { 

// -H-H 

// e-asyncs in S1 and S 2 
// do not throw exceptions. 

3. Tail Finish Elimination 

finish { 
finish { 

^ <exlisti> 

^ <exlist 2 > 


finish { 

51 

exlisti 

52 

^ <exlist 2 > 


try { finish { Si } 
exlisti; 

} catch (e:Exception) { 
val me = new ME(e); 
throw me; }<exlist 2 > 


Fig. 9: Rules of Figure|^ in the presence of exceptions. 


Expansion Lower requires that no exceptions are thrown by the e-asyncs of both SI and S2; 
execution of S2 occurs only if SI and exlist throw no exceptions. Async-Finish Interchange 
requires that SI does not throw exceptions. It also requires the finish has no pending ex¬ 
ceptions. Besides the extensions to the rules from Figure in the presence of exceptions, 
we need another transformation - Try-Finish Exchange. This transformation requires that no 
exceptions are thrown by e-asyncs in SI. For the ease of explanation, we explain the modifi¬ 
cations to the Finish-Method Pull transformation (of Figure |^, using the following example: 


def bar() { 

foo 0 ; } 

def foo () { 

var e:Ex; 

finish SI; 

<if (e != null) throw e;> } 


var gex:Ex; 
def bar() { 

var e:Ex; 

finish { foo(); e=gex; 

} <if (e!=null)throw e;> } 
def foo() { var e:Ex; SI; gex 


e; } 


Here we add a new instance field gex that will store the exception (e) occurring inside the method 
foo () and will throw e at the call site of foo (). 

Figure [^presents the extensions for the three mini transformations of Figure ^ in the presence of 
the exceptions. Rule#l ensures that S3 is executed only if no exceptions are thrown by cond, S2 
and exlist. Rule#2 ensures that S2 is executed only if no exception is thrown in exlisti. Rule#3 
uses a try-catch block to capture the exceptions thrown by the inner finish and exlisti, and 
rethrow it later. 
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4.2. DLBC in the presence of exceptions 

Note that DLBC invokes LC as its first substep. And LC is semantics preserving in the presence 
of exceptions. It can be easily seen that the code introduced by DLBC does not alter the program 
semantics (even in the presence of exceptions). 


5. EVALUATION 

In this section we evaluate our proposed optimizations: AFE and DLBC. We analyze these opti¬ 
mizations on two different systems - a 16 core Intel system (2 Intel E5-2670 2.6GHz processors x 
8 cores per processor) and a 64 core AMD system (4 AMD Abu Dhabi 6376 processors x 16 cores 
per processor). 

We implemented AEE and DLBC, as whole program optimization techniques, in the xlO-2.3.0- 
linux compiler and present an evaluation of our optimizations using the Native XIO (C-H-) backend. 
Each execution time reading is reported by taking an average over ten runs. We evaluate our op¬ 
timizations on a set of eight RTF kernels (listed in Eigure [T0) i, where data parallel loops are the 
only means of specifying parallelism. The hrst hve are taken from the IMSuite [Gu pta and Nandi- 
vada(2015) | and the rest three are part of the BOTS |Duran et al.(2009)Duran, Teruel, Eerrer, Mar- 
torell, and Ayguad^ benchmark suite. Note that, BFS, DST, and MST also have their non-clocked 
versions in IMSuite. But we chose the clocked versions owing to their added complexity related to 
barriers. 

Eigure (hrst two columns) provides a brief overview of the benchmarks and their respective 
input data sets. Eor each BOTS benchmark, we list the input type (e.g.. Large, Medium) and for 
each IMSuite benchmark, we list the input size and a note if we are using the standard input or 
a modihed one. Eor all the benchmarks (except DST and MST) we have used one of the standard 
inputs provided. Eor DST and MST we found that the default inputs were not leading to much recur¬ 
sion (as the diameter of the input graph was around 2 or 3), thereby rendering the program nearly 
non-recursive. To overcome this challenge, we used their respective input generators (provided by 
IMSuite) to generate larger and denser graphs. In the modihed inputs, we cap the maximum number 
of neighbors of any node to be at 40% of the total nodes; the default inputs have no such limit, 
thereby generate graphs with very small diameter. Eor all the benchmarks, the chosen input size was 
the largest input such that the corresponding input program takes not more than an hour, when run 
on our 16-core Intel system. 


5.1. Dynamic characteristics 

Eigure [^includes the dynamic characteristics of the benchmarks under consideration. We executed 
these kernels on the specihed inputs and collected the dynamic counts for the task creation (async) 
and task termination (finish) operations. The last two columns of EigurefTO} present these charac¬ 
teristics for the unoptimized (UnOpt), Loop Chunking (LC) and DCAEE (= DLBCh-AFE) versions. 

It can be seen that in comparison to both the UpOpt and LC versions, DCAEE achieves a signifi¬ 
cant reduction in the number of async and finish constructs, for BFS, NQ and BTkernels. Eor 
DR, HL and FL there is a significant reduction in the number of async operations but as AEE is not 
able to pull out many of the finish constructs (due to MHBD), a substantial reduction in the num¬ 
ber of finish operations is not achieved. In case of DST and MST as the number of finish and 
async operations is low (for the UnOpt and LC versions), the reduction in their counts (because of 
DCAEE) is also less. 


5.2. Comparing DCAFE Vs LC 

Eigure [TT]compares the performance of DCAFE with respect to LC, for varying number of cores (in 
the powers of two). Eigure [n(a)| presents the speedups resulting from DCAEE over the LC policy on 
the Intel system; higher the better. We vary the number of cores and the XI O.NTHREADS from 1 to 
16, in sync (i.e., for simulations on a 4 core setup, we set XI O.NTHREADS to 4). The performance 
improvement, for each kernel depends on a varied set of factors - the behavior of the kernel, the 
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Kernel 

Input 

Type 

#Finish 

#Async 

Breadth 

256 

UnOpt 

58k 

950k 

First 


LC 

31k 

379k 

Search* (BF5) 

(Standard) 

DCAFE 

1 

64 

Byzantine 

128 

UnOpt 

276k 

3869k 

(BY) 


LC 

276k 

3308k 


(Standard) 

DCAFE 

34 

18k 

Dijkstra 

512 

UnOpt 

28k 

631k 

Routing 


LC 

28k 

338k 

(DR) 

(Standard) 

DCAEE 

17k 

23k 

Breadth 

2048 

UnOpt 

3.2k 

26k 

First 


LC 

3.2k 

Ik 

Search* (DST) 

(Modified) 

DCAEE 

18 

338 

Minimum 

512 

UnOpt 

3.1k 

6.3k 

Spanning 


LC 

3.1k 

2k 

Tree*(MSD 

(Modified) 

DCAEE 

l.lk 

L5k 

Nqueens 


UnOpt 

26993k 

377901k 

(NQ) 

(Large) 

LC 

26993k 

377901k 



DCAEE 

1 

3460k 

Health 


UnOpt 

17516k 

630575k 

(HL) 

(Large) 

LC 

17516k 

210192k 



DCAEE 

1636k 

2851k 

Floorplan 


UnOpt 

3678k 

19244k 

(FL) 

(Medium) 

LC 

3657k 

19193k 



DCAEE 

3619k 

1650k 


Fig. 10; Benchmark statistics; starred(*) ones have barriers. 


scope for reducing the task creation (async) and the task termination (finish) operations, the 
nature of the input, runtime/OS related factors and the hardware characteristics. 

It can be seen that for kernels BFS, DR, NQ and HL, our technique achieves significant speedups 
on increasing the number of cores (and thus incrasing values of XI 0_NTHREADS). These speedups 
can be attributed to the varied effects of increased parallelism on LC and DCAFE. As the number 
of X10_NTHREADS increases, LC creates more number of tasks at each level. In contrast, DCAFE 
creates tasks, only if idle workers are available, and thereby is able to take advantage of the increased 
number of cores. Thus, comparatively DCAEE has low overheads and synchronization costs, which 
improve its relative performance. This is one of the main reasons for the sudden peak in case of NQ 
at 16 cores: the execution time for LC increases sharply due to excessive task creation, while the 
DCAEE version maintains its scalable nature (uniform decrease in execution time), as the number 
of cores are increased. Eor HL we observe a dip in its performance on moving from 2 cores to 
4 cores. This behavior is not due to any deterioration in the performance of DCAEE version or 
improvement of performance of the LC version for four cores, but because of the comparatively 
lower performance of the LC version at two cores. We hypothesize this behavior of the LC generated 
code to the system specific scheduling policies. 

A general observation is that when the number of cores are less (1 and 2) the performance gains 
for DCAEE are insignificant in comparison to LC. This can be attributed to the fewer opportunities 
for expressing parallelism and the smaller value of XI0_NTHREADS. Eor such a setup, both the 
DCAEE and the LC create few tasks at each level. Thus, DCAEE is not able to record significant 
task reductions and show gains. 

For kernels DST and MST, DCAFE is unable to achieve significant speedups over LC. This be¬ 
havior can be attributed to the fewer opportunities for reduction of task creation and termination 
operations (number of async and finish operations < 3k, see EigurefTO]!. 

FL is an interesting kernel where, at times, DCAEE performs worse than LC. In FL the task 
creation occurs inside a doubly nested loop, while the finish construct is outside the nested loops. 
Also, the finish construct cannot be eliminated due to dependencies. Importantly, the inner loop 
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(a) Intel 16-core system; varying runtime configuration Dn, where n = #cores = XI 0_NTHREADS . 



BFS BY DR DST MST NQ HL FL GeoMean 

Benchmark kernels -> 


(b) AMD 64-core system; varying runtime configuration Dn, where n = #cores = X10_NTHREADS . 


Fig. 11: Speedups for varying number of cores; Speedup 


execution time of LO version 
execution time of DCAFE version 


does not spawn enough tasks (to optimize to see visible gains). Due to these factors, the DCAFE 
versions do not have enough scope for improvement, but do more serial work compared to the LC 
versions (see Section [331 l, which in turn affects its comparative performance. 

In case of BY, although DCAFE decreases the number of task creation and termination operations 
by a good measure, the performance gains are minimal. We find that BY is the only kernel where 
the UnOpt version performs better than both LC and DCAEE. This curious behavior results from 
the nature of BY and the density of input. In case of BY, similar to FL, there isn’t much opportunity 
for loop chunking, Further, importantly, the work done by majority of the spawned tasks in BY is 
negligible. However, compared to the UnOpt version, the LC version introduces additional work in 
each task (to calculate the chunks and so on). And this additional work adds to the time taken 
by the LC versions. We can see that DCAEE is actually successful in bridging the gap between the 
performance of LC and the UnOpt to some extent. This could be possible only due to the significant 
decrease in number of finish and async operations. However, as discussed in Section [TA] the 
overheads of DCAEE amortize the overall performance gains. 

Figure 11(b) shows the behavior for the eight kernel benchmarks on the 64 core AMD system. In 
these plots, we vary the number of cores and XI 0_NTHREADS from 1 to 64, in sync. On increasing 
the cores from 1 to 16, we observe that the performance of the kernels is similar to that of Fig- 
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ure 11(a) Except in case of HL, where the dip in performance discussed in the context of the Intel 
system, is not seen here. Thus giving credence to the hypothesis that this may be related to some 
system level scheduling issues. 

On moving from 16 to 32 to 64 cores, we observe interesting characteristics. For all the kernels 
(except DR, DST and MST), there is an increase in execution time (not shown), for all the three 
versions UnOpt, LC and DCAFE. This behavior highlights that further gains in execution time 
cannot be achieved from these kernel versions (especially, for this input), by increasing the number 
of cores. Thus, the performance gains achieved by the DCAFE versions (shown in Figure 1 l(b)| i are 
due to the large performance degradation of EC versions, in comparison to DCAFE. For example, 
DCAFE version for FL, which had a slight dip in performance over EC (on 8 and 16 cores), achieves 
performance (for 32 and 64 cores), due to large degradation in the execution time of the EC versions. 

For kernels DST and MST, as discussed earlier (for the Intel system), the performance gains are 
not substantial due to less opportunities for exploiting parallelism. In case of DR kernel, we observe 
that the DCAFE version performs better, on moving from 16 to 32 to 64 cores, but the EC version 
does not follow this trend. The LC version suddenly performs better for 32 cores. This leads to the 
visible dip in the speedup of DCAFE over LC, for 32 cores. 

Overall, with respect to the LC versions, the DCAFE versions achieve speedup in the ranges of 
O.lx - 33.34x (geometric mean of 5.75x), on the Intel system, and 1.07x - 22.5x, (geometric 
mean of 4.16 x), on the AMD system. 

5.3. Performance evaluation of all the proposed techniques 

We now compare the performance of Serial, UnOpt+AFE, LC, LC+AFE, DLBC and DCAFE, with 
respect to UnOpt, in Figure 12 For brevity, we evaluate the kernels only for the largest set of 


hardware cores (i.e. 16 cores on Intel system and 64 cores on AMD) and XI O.NTHREADS is set to 
#cores. All the results are normalized with respect to the execution times for the UnOpt versions. 

It can seen that AFE does not reduce the number of redundant finish constructs for kernels 
DR, FIL and FD, and hence AFE has no effect (as shown by the numbers of (i) DCAFE Vs DLBC, 
(ii) LC Vs LC+AFE, and (iii) UnOpt Vs UnOpt+AFE). It can be seen that DLBC and LC+AFE 
perform better than LC (most of the time), in the context of RTF programs. The exact performance 
improvement may differ from one kernel to another (depending on the amount of available paral¬ 
lelism). These two techniques when used in conjunction (as DCAFE), perform significantly better 
compared to all the presented techniques. 

For DST, MST and FL, as mentioned earlier, the performance improvement may not be significant, 
and can rather have a slight dip, as there is limited scope for task reduction. 

In case of BFS, we see a significant drop in the performance for LC+AFE on the Intel system, but 
the plot for the AMD system does not show such a dip. We ran the same benchmark on the AMD 
system for 16 cores and found that the LC+AFE version showed a similar behavior. We observed 
that as the number of cores increase the performance of LC+AFE version of BFS improves. 

Considering the impact of UnOpt+AFE, it can be seen that the AFE alone is unable to achieve 
much performance difference, even in the kernels where AFE leads to reduction in the number of 
finish operations. This is due to the overheads arising out of the increased bookkeeping activities 
(see Section [j3] l, that neutralize the gains. 

For BFS, FIL and MST, the UnOpt versions perform worse than the Serial version, because of the 
overheads due to parallelization (such as cost for task creation, task termination and barriers). How¬ 
ever, DCAFE is able to reduce these overheads and realize gains. HL shows an interesting scenario, 
where DCAFE performs better than Serial in Figure [T2(a)| but performs poorly in Figure [T2(b)| On 
further investigation we found that the DCAFE version of HL actually performs better than the se¬ 
rial, when it is run on 8 and 16 cores (on the AMD system). This is consistent with the performance 
of DCAFE shown in Figure [11(6)1 A similar reason holds for NQ, where the Serial version performs 


better than the UnOpt version in Figure 12(b) but not in Figure 12(a) 


Overall, it can be seen that compared to DLBC, AFE reaps less performance improvements. But 
we argue that its impact cannot be ignored. Skipping the benchmarks (DR, HL and FL), where AFE 
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(a) Intel 16-core system, configuration: #cores = X10_NTHREADS = 16. 



(b) AMD 64-core system, static configuration: #cores = XI 0_NTHREADS = 64. 


Fig. 12: Comparison of different schemes with respect of UnOpt. Performance of scheme X with respect to 

TT ^ . _ execution time of UnOpt version 

r execution time of X 

did not do any transformation, it can be seen that the impact of AFE is between 1.8% to 45.9%, 
which we believe is significant. 

To summarize: with respect to UnOpt, our techniques LCh-AFE, DLBC and DCAEE achieve 
speedups (geometric mean) of 1.31 x, 12.28x and 12.64x, respectively; compared to these, EC 
achieves a speedup of only 2.2x, on the Intel system. Similarly, it can be seen that on the AMD sys¬ 
tem LCh-AEE, compared to the UnOpt version, DLBC and DCAEE achieve a speedups (geometric 
mean) of 1.02x, 4.29x and 5.25 x, respectively; compared to these EC achieves a speedup of only 
1.01 X over UnOpt. 


5.4. Energy Consumption 

We now discuss the effect of DCAEE and EC on the energy consumption of the benchmark kernels 
(on the Intel system). We implemented a function read_msr that uses the Intel Running Average 
Power Limit (RAPE) | Intel(2014)| interface to read the energy consumption of all the cores of a 
node. We modify the compiler to emit a call to this function, before and after the execution phase to 
calculate the energy difference. We couldn’t find a similar interface for our AMD system. 
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Fig. 13; Energy consumption normalized to UnOpt. 


Figure [T3]depicts the energy consumed by the LC and DCAFE versions of the eight kernel bench¬ 
marks. All the results are normalized to their UnOpt counterparts. It can be seen that for most of 
the benchmarks, both EC and DCAFE versions show reduction in energy consumption. However, 
the reduction due to DCAFE is much more than that resulting from LC. Overall, it can be seen that 
compared to the UnOpt versions, the DCAFE versions consume energy in the range of 0.007 x - 
l.lOSx (geometric mean of 0.19x), while the LC versions consume energy in the range of 0.284x 
-1.554 X (geometric mean of 0.658 x). Overall, compared to the LC versions, the DCAFE versions 
consume energy in the range of 0.026 x - 0.999 x (geometric mean 0.288 x). Thus, on average the 
DCAFE versions consume 71.2% less energy than the LC counterparts. 

We observe that, maximum energy savings is achieved for kernels BFS, DR, NQ and HL. These 
savings directly follow the significant reduction in the execution time, which in turn is due to the 
reduction in task creation and termination operations for these kernels. On the other hand, for DST, 
MST and FL, there isn’t a significant reduction in the energy consumption, which can be attributed 
to the less task reduction opportunities available in these kernels. In case of BY, compared to the 
UnOpt version, the energy consumption of both DCAFE and LC versions is higher (follows the 
trend of the execution time). However, it can be seen that DCAFE reduces the energy overheads of 
LC to a large extent. 

6. DISCUSSION 

In this section we discuss some general discussion about our proposed optimizations, their scope 
and alternatives. 

Non-triviality of DLBC: To optimize RTF programs with loops using low level synchronization 
primitives (like XIO clocks), DLBC includes many non-trivial extensions to LC. These include i) 
the scheme of executing the loop serially and doing so for a subset of iterations, before proceeding 
to create parallel tasks to execute the rest of the iterations; and ii) conditionally executing the 
loop in parallel and ensuring that the parent worker does some useful work, besides waiting for 
the other threads to join. These proposed extensions give rise to many interesting design choices: 
i) how/when to switch between serial and parallel codes, ii) procedure to compute the chunking 
factor, iii) procedure to identify the idle count of worker threads and so on. Besides the particular 
design choices described in Section [T2] we tested many other alternatives and finally zeroed in on 
the most prohtable ones. Some of the choices we tested are listed below for pedagogy, (a) Static 
cut-off based on the recursion depth - This scheme stops creating new parallel tasks, once the depth 
of the recursion crosses a certain static cut-off value (such as, 2, 3, 4, and 5). Thus in this scheme, 
based on the cutoff, we keep creating parallel tasks even if there are no free workers. Similarly, even 
if there are free workers, we do not create new parallel tasks after the pre-specihed recursion depth. 
Pros: Simple to implement. Cons: Hard to predict the optimal cutoff and minimize the overheads. 
Our conclusion after experimentation; Overall inefficient and impractical, (b) Trade-offs in the 
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serial block - To avoid checking for available parallel workers after each serial iteration, we tried 
the strategy of checking for available workers only after a fixed number of serial iterations (e.g., 
2, 3, 4). The main intuition was to allow parallel execution when there are sufficient number of 
workers. Pros; Reducing the overhead of checking for available workers and waiting for sufficient 
number of workers. Cons: May miss some chances to parallelize some iterations. Our conclusion 
after experimentation; The complexity of the additional checks did not pay off. (c) Minimum 
number of parallel tasks instead of complete serialization — DLBC turns to serial code when there 
are no available free workers. We tried a scheme, where instead of executing the loop in serial, we 
divided it into two chunks - one chunk executed as part of the current task, and the second one is 
executed by a new parallel task. Pros: Chances of workers remaining free will be small. Cons: May 
end up creating more tasks than required. Our conclusion after experimentation; The cons over 
weighed the pros. 

Runtime Optimizations: AFE involves elaborate dependence analysis and code transformation 
schemes that are non-local in nature (even in the absence of exceptions). Re-casting of AFE as a 
runtime optimization may seem attractive, but is both non-trivial and can be expensive. Similarly, 
DLBC requires generation of serial-code from the input parallel code. This process is non-trivial, 
especially in the presence of deeply nested barriers such as clocks. Both AFE and DLBC are whole 
program optimizations that have intuitive compile-time implementation and reap runtime benefits. 


Scope of AFE and DLBC: AFE and DLBC are not restricted to only XIO and can be applied 
to other task-parallel languages with similar constructs such as HJ (async/finish) and Chapel (be¬ 


gin/sync). Further, DLBC can also be used in other task parallel languages such as Cilk |Leiser- 
|son(2009)| and OpenMP |OpenMP(2008)l . 


7. RELATED WORK 


There have been several works |Cytron et al.(1990)Cytron, Lipkis, and Schonberg[ Heinz and 


Phili ppsen(1993)[ Tseng(1995) jFerrer et al.(2010)Ferrer, Duran, Martor ell, and Ayguad6[|Noll and 


Gross(2012)[|Nandivada et al.(2013)Nandivada, Shirako, Zhao, and Sarkar) that aim to reduce the 


overheads resulting from useless synchronization and join operations. Cytron et al. propose re 
duction of synchronization constructs by translating input fork-join code to SPMD code with re¬ 
duced number of barriers. Heinz and Philippsen perform source to source transformations to reduce 
the barrier synchronization operations in data parallel programs. Their optimizations target the re¬ 
dundant synchronization operations present in the synchronous FORALL statements by converting 
them into simplified asynchronous FORALL statements with reduced synchronization overheads. 
Tseng extends the work of Cytron et al. by using a combined fork-join and SPMD model to reduce 
synchronization overheads. Ferrer et al. exploit the loop unrolling transformation in the presence 
of task parallel constructs. The authors try to aggregate multiple fine-grained tasks (by unrolling 
loop) into the larger ones to achieve performance. Noll and Gross propose task reduction and syn¬ 
chronization optimizations for the JIT compilers. The authors propose an optimization that allows 
merging of small concurrent tasks into a large task. Compared to these, our optimizations eliminate 
redundant task creation and termination operations in recursive task parallel programs. Further, we 
present a scheme to do the transformations in a semantics preserving manner, even in the presence 
of exceptions. 

Yonezawa et al. [Yonezawa et al.(2006)Yonezawa, Wada, and Aida) aim at reducing the barrier 
synchronization operations, by generating efficient communication code for data transfer operations 
in a distributed application. Similarly, Bikshandi et al. [Bikshandi et al.(2009)Bikshandi, Castanos, 


Kodali, Nandivada, Peshansky, Saraswat, Sur, Varma, and Wen J propose methods to effi ciently ex¬ 
ecute outer-most finish operations. Nagarajan and Gupta [Nagarajan and Gupta(2010)) use specu¬ 
lative execution to reduce the overheads associated with barriers. We believe that these techniques 
can be used in conjunction with our proposed AFE, to further increase the performance gains. 
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Nicolau et al. |Nicolau et al.(2009)Nicolau, Li, Veidenbaum, and Kejariwal) propose optimiza¬ 
tions (via code percolation) to reduce the synchronization operations such as post and wait that 
are redundant. In contrast, we present techniques to expand the scope of finish operations to 
reduce the number of finish operations, especially in the context of recursive task parallel pro¬ 
grams. 

Our work is most closely related to the work of Nandivada et al | |Nandivada| 
et al.(2013)Nandivada, Shirako, Zhao, and Sarkar) , who present a framework to reduce task cre¬ 
ation, synchronization and termination operations. They specify a set of three techniques - finish 
elimination, forall coarsening and (static) loop chunking - that generates efficient code for task par¬ 
allel programs. Compared to their approach, we present an approach to do efficient loop chunking 
(dynamic) and aggressive finish elimination in the context of recursive task parallel programs. 

Narayanan et al. [Narayanan et al.(2005)Narayanan, Chen, Kandemir, and Xie| use classical loop 
chunking to generate power efficient code. Their transformation distributes equal chunks of itera¬ 
tions on different processors. To the best of our knowledge, ours is the first paper that studies the 
impact of reduction in task creation and termination operations on the energy consumed. 

Loop scheduling [Kennedy and Allen(2002) | has been one of the most popular techniques to 
efficiently execute loop nests. Some of the popular schemes of loop scheduling are static (dividing 
the all the iterations equally among the declared workers), dynamic (the iterations are divided into 
many small chunks and added to a work queue and each free worker takes a chunk from this work 
queue to execute), and guided (similar to dynamic, but the size of the chunks vary dynamically). 
Our proposed DLBC method can be seen as a specialization of loop scheduling where i) iterations 
scheduled to be executed by the same processor are executed sequentially, ii) some iterations of the 
parallel loop may be executed sequentially, before dividing the rest of the loop iterations among the 
available workers. 

There have been many works [ Wilson et al.(1994)Wilson, French, Wilson, Amarasinghe, Ander-| 
son, Tjiang, Liao, Tseng, Hall, Lam, and Hennessy[|Hall and Martonosi(1998)[jYue and Lilja(1996)f 

that computes and assigns the optimal number of processors / workers to execute a given loop nest 
and parallelize the loop accordingly. In contrast, we use a simple scheme of chunking parallel loops 
based on the number of available worker threads (number of chunks = number of available work¬ 
ers). It would be interesting to extend our proposed DLBC with more sophisticated mechanisms to 
compute the optimal number of worker threads. 

Voss and Eigenmann | Voss and Eigenmann(I999)) proposed an inspector-executor model that at 
runtime decides whether to execute a loop in parallel or serially. The main emphasis behind this 
scheme is that benefits of executing a loop in parallel may be amortized if the overheads of parallel 
execution are significant. The authors first try to run a loop in parallel and measure its execution 
time. They next compare the obtained results with the timed results of the serial version of the loop 
and decide whether to run the next versions of this loop in parallel or not. 

There have been several prior works that control the parallelism based on different kinds of 
thresholds (all measured at runtime). Eor non RTF programs, some of the popular threshholds are 
system load [Kranz et al.(1989)Kranz, Halstead, and Mohrt|Certner et al.(2008)Certner, Li, Palatin. 


Temam, Arzel, a nd Drach), size of the data st ructur es ([Hu elsbergen et al.(1994)Huelsbergen, Laras, 
and Aiken[ [Aharoni et al.(1992)Aharoni, Eeitelson, and Barak) giving an estimation of the time the 


code to be parallelized may take to execute, and profile based estimated workload in different iter- 
ations [ Prechelt and Hanssgen(2002 )|. Eo r RTP programs, Duran et al [Duran et al.(2009)Duran,| 
[Teruel, Eerrer, Martorell, and Ayguadel show the use of a static value of recursion depth as 


a cut-off for parallelization. Similarly, dynamic cut-offs based on runtime parameters | Duran 
et al.(2008)Duran, Corbalan, and Ayguade) have also been used for RTP programs. Considering 
the difficulties in statically determining the appropriate recursion depth, and the overheads in the 


dynamic approach of Duran et al | Duran et al.(2008)Duran, Corbalan, and Ayguadel (requires ad¬ 
ditional monitoring threads), we propose a scheme to determine the number of parallel tasks based 
on the number of available free workers. 
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Our idea of task creation based on worker availability and “serial block” (in DLBC) can be seen 
as a compiler based extension of lazy-binary splitting (LBS) scheme [Tzannes et al.(2010)Tzannes, 
jCaragea, Barua, and Vishkm| for RTF programs and programs with synchronization operations. It 
would be interesting to evaluate the effect of DCAFE on an LBS based runtime scheduler. 


8. CONCLUSION 

In this paper, we present two new optimizations AFE (“Aggressive Finish Elimination”) and DLBC 
(“Dynamic Load-Balanced loop Chunking”) to reduce the task creation and termination overheads 
in recursive task parallel (RTF) programs. These optimizations improve the performance, both in 
terms of execution time and energy consumption. We implemented DCAFE (= DLBCh-AFE) in the 
X10v2.3 compiler and performed experiments on two different hardware systems (a 16-core In¬ 
tel system and a 64-core AMD system). Compared to the loop chunking scheme of Nandivada et 
al iNandivada et al.(2013)Nandivada, Shirako, Zhao, and Sarkar|, DCAFE achieved significant im¬ 
provements in execution time (geometric mean of 5.75 x and 4.16x, on the Intel and AMD system, 
respectively), and substantial reduction in the energy consumption (geometric mean of 71.2% on the 
Intel system). The significant improvements in execution time and reduction in energy consumption 
attest to the scope of the proposed optimizations. Though our results are shown in the context of 
XIO, we believe that our proposed optimizations can be applied (with similar effect) to other task 
parallel languages like OpenMF, Chapel and HI that admit RTF programs. 
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